Brief description of the data set and a summary of its attributes
Initial plan for data exploration
Actions taken for data cleaning and feature engineering
Key Findings and Insights, which synthesizes the results of Exploratory Data Analysis in an insightful and actionable manner
Formulating at least 3 hypothesis about this data
Conducting a formal significance test for one of the hypotheses and discuss the results
Suggestions for next steps in analyzing this data
A paragraph that summarizes the quality of this data set and a request for additional data if needed
%pylab inline
%config InlineBackend.figure_formats = ['retina']
import pandas as pd
import seaborn as sns
sns.set()
Populating the interactive namespace from numpy and matplotlib
## Load the Ames Housing Data
datafile = "data/Ames_Housing_Data.tsv"
df = pd.read_csv(datafile, sep='\t')
# A quick look at the data:
df.head()
| Order | PID | MS SubClass | MS Zoning | Lot Frontage | Lot Area | Street | Alley | Lot Shape | Land Contour | ... | Pool Area | Pool QC | Fence | Misc Feature | Misc Val | Mo Sold | Yr Sold | Sale Type | Sale Condition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 526301100 | 20 | RL | 141.0 | 31770 | Pave | NaN | IR1 | Lvl | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2010 | WD | Normal | 215000 |
| 1 | 2 | 526350040 | 20 | RH | 80.0 | 11622 | Pave | NaN | Reg | Lvl | ... | 0 | NaN | MnPrv | NaN | 0 | 6 | 2010 | WD | Normal | 105000 |
| 2 | 3 | 526351010 | 20 | RL | 81.0 | 14267 | Pave | NaN | IR1 | Lvl | ... | 0 | NaN | NaN | Gar2 | 12500 | 6 | 2010 | WD | Normal | 172000 |
| 3 | 4 | 526353030 | 20 | RL | 93.0 | 11160 | Pave | NaN | Reg | Lvl | ... | 0 | NaN | NaN | NaN | 0 | 4 | 2010 | WD | Normal | 244000 |
| 4 | 5 | 527105010 | 60 | RL | 74.0 | 13830 | Pave | NaN | IR1 | Lvl | ... | 0 | NaN | MnPrv | NaN | 0 | 3 | 2010 | WD | Normal | 189900 |
5 rows × 82 columns
## Examine the columns, look at missing data
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2930 entries, 0 to 2929 Data columns (total 82 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Order 2930 non-null int64 1 PID 2930 non-null int64 2 MS SubClass 2930 non-null int64 3 MS Zoning 2930 non-null object 4 Lot Frontage 2440 non-null float64 5 Lot Area 2930 non-null int64 6 Street 2930 non-null object 7 Alley 198 non-null object 8 Lot Shape 2930 non-null object 9 Land Contour 2930 non-null object 10 Utilities 2930 non-null object 11 Lot Config 2930 non-null object 12 Land Slope 2930 non-null object 13 Neighborhood 2930 non-null object 14 Condition 1 2930 non-null object 15 Condition 2 2930 non-null object 16 Bldg Type 2930 non-null object 17 House Style 2930 non-null object 18 Overall Qual 2930 non-null int64 19 Overall Cond 2930 non-null int64 20 Year Built 2930 non-null int64 21 Year Remod/Add 2930 non-null int64 22 Roof Style 2930 non-null object 23 Roof Matl 2930 non-null object 24 Exterior 1st 2930 non-null object 25 Exterior 2nd 2930 non-null object 26 Mas Vnr Type 2907 non-null object 27 Mas Vnr Area 2907 non-null float64 28 Exter Qual 2930 non-null object 29 Exter Cond 2930 non-null object 30 Foundation 2930 non-null object 31 Bsmt Qual 2850 non-null object 32 Bsmt Cond 2850 non-null object 33 Bsmt Exposure 2847 non-null object 34 BsmtFin Type 1 2850 non-null object 35 BsmtFin SF 1 2929 non-null float64 36 BsmtFin Type 2 2849 non-null object 37 BsmtFin SF 2 2929 non-null float64 38 Bsmt Unf SF 2929 non-null float64 39 Total Bsmt SF 2929 non-null float64 40 Heating 2930 non-null object 41 Heating QC 2930 non-null object 42 Central Air 2930 non-null object 43 Electrical 2929 non-null object 44 1st Flr SF 2930 non-null int64 45 2nd Flr SF 2930 non-null int64 46 Low Qual Fin SF 2930 non-null int64 47 Gr Liv Area 2930 non-null int64 48 Bsmt Full Bath 2928 non-null float64 49 Bsmt Half Bath 2928 non-null float64 50 Full Bath 2930 non-null int64 51 Half Bath 2930 non-null int64 52 Bedroom AbvGr 2930 non-null int64 53 Kitchen AbvGr 2930 non-null int64 54 Kitchen Qual 2930 non-null object 55 TotRms AbvGrd 2930 non-null int64 56 Functional 2930 non-null object 57 Fireplaces 2930 non-null int64 58 Fireplace Qu 1508 non-null object 59 Garage Type 2773 non-null object 60 Garage Yr Blt 2771 non-null float64 61 Garage Finish 2771 non-null object 62 Garage Cars 2929 non-null float64 63 Garage Area 2929 non-null float64 64 Garage Qual 2771 non-null object 65 Garage Cond 2771 non-null object 66 Paved Drive 2930 non-null object 67 Wood Deck SF 2930 non-null int64 68 Open Porch SF 2930 non-null int64 69 Enclosed Porch 2930 non-null int64 70 3Ssn Porch 2930 non-null int64 71 Screen Porch 2930 non-null int64 72 Pool Area 2930 non-null int64 73 Pool QC 13 non-null object 74 Fence 572 non-null object 75 Misc Feature 106 non-null object 76 Misc Val 2930 non-null int64 77 Mo Sold 2930 non-null int64 78 Yr Sold 2930 non-null int64 79 Sale Type 2930 non-null object 80 Sale Condition 2930 non-null object 81 SalePrice 2930 non-null int64 dtypes: float64(11), int64(28), object(43) memory usage: 1.8+ MB
df.describe()
| Order | PID | MS SubClass | Lot Frontage | Lot Area | Overall Qual | Overall Cond | Year Built | Year Remod/Add | Mas Vnr Area | ... | Wood Deck SF | Open Porch SF | Enclosed Porch | 3Ssn Porch | Screen Porch | Pool Area | Misc Val | Mo Sold | Yr Sold | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2930.00000 | 2.930000e+03 | 2930.000000 | 2440.000000 | 2930.000000 | 2930.000000 | 2930.000000 | 2930.000000 | 2930.000000 | 2907.000000 | ... | 2930.000000 | 2930.000000 | 2930.000000 | 2930.000000 | 2930.000000 | 2930.000000 | 2930.000000 | 2930.000000 | 2930.000000 | 2930.000000 |
| mean | 1465.50000 | 7.144645e+08 | 57.387372 | 69.224590 | 10147.921843 | 6.094881 | 5.563140 | 1971.356314 | 1984.266553 | 101.896801 | ... | 93.751877 | 47.533447 | 23.011604 | 2.592491 | 16.002048 | 2.243345 | 50.635154 | 6.216041 | 2007.790444 | 180796.060068 |
| std | 845.96247 | 1.887308e+08 | 42.638025 | 23.365335 | 7880.017759 | 1.411026 | 1.111537 | 30.245361 | 20.860286 | 179.112611 | ... | 126.361562 | 67.483400 | 64.139059 | 25.141331 | 56.087370 | 35.597181 | 566.344288 | 2.714492 | 1.316613 | 79886.692357 |
| min | 1.00000 | 5.263011e+08 | 20.000000 | 21.000000 | 1300.000000 | 1.000000 | 1.000000 | 1872.000000 | 1950.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 2006.000000 | 12789.000000 |
| 25% | 733.25000 | 5.284770e+08 | 20.000000 | 58.000000 | 7440.250000 | 5.000000 | 5.000000 | 1954.000000 | 1965.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 4.000000 | 2007.000000 | 129500.000000 |
| 50% | 1465.50000 | 5.354536e+08 | 50.000000 | 68.000000 | 9436.500000 | 6.000000 | 5.000000 | 1973.000000 | 1993.000000 | 0.000000 | ... | 0.000000 | 27.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6.000000 | 2008.000000 | 160000.000000 |
| 75% | 2197.75000 | 9.071811e+08 | 70.000000 | 80.000000 | 11555.250000 | 7.000000 | 6.000000 | 2001.000000 | 2004.000000 | 164.000000 | ... | 168.000000 | 70.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 8.000000 | 2009.000000 | 213500.000000 |
| max | 2930.00000 | 1.007100e+09 | 190.000000 | 313.000000 | 215245.000000 | 10.000000 | 9.000000 | 2010.000000 | 2010.000000 | 1600.000000 | ... | 1424.000000 | 742.000000 | 1012.000000 | 508.000000 | 576.000000 | 800.000000 | 17000.000000 | 12.000000 | 2010.000000 | 755000.000000 |
8 rows × 39 columns
Ames Housing Data is related to the sale price of houses depending on the facilities of the house mainly the lot area, year built. We will need to select the important features by analyzing the missing values of each sector.
# There are a *lot* of variables. Let's look at how many values are missing for each variable.
df.isnull().sum().sort_values()
Order 0
Sale Condition 0
Heating QC 0
Central Air 0
1st Flr SF 0
...
Fireplace Qu 1422
Fence 2358
Alley 2732
Misc Feature 2824
Pool QC 2917
Length: 82, dtype: int64
Initial plan for data exploration: Let's pick out just a few numeric columns to illustrate basic feature transformations.
smaller_df= df.loc[:,['Lot Area', 'Overall Qual', 'Overall Cond',
'Year Built', 'Year Remod/Add', 'Gr Liv Area',
'Full Bath', 'Bedroom AbvGr', 'Fireplaces',
'Garage Cars','SalePrice']]
# Now we can look at summary statistics of the subset data
smaller_df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Lot Area | 2930.0 | 10147.921843 | 7880.017759 | 1300.0 | 7440.25 | 9436.5 | 11555.25 | 215245.0 |
| Overall Qual | 2930.0 | 6.094881 | 1.411026 | 1.0 | 5.00 | 6.0 | 7.00 | 10.0 |
| Overall Cond | 2930.0 | 5.563140 | 1.111537 | 1.0 | 5.00 | 5.0 | 6.00 | 9.0 |
| Year Built | 2930.0 | 1971.356314 | 30.245361 | 1872.0 | 1954.00 | 1973.0 | 2001.00 | 2010.0 |
| Year Remod/Add | 2930.0 | 1984.266553 | 20.860286 | 1950.0 | 1965.00 | 1993.0 | 2004.00 | 2010.0 |
| Gr Liv Area | 2930.0 | 1499.690444 | 505.508887 | 334.0 | 1126.00 | 1442.0 | 1742.75 | 5642.0 |
| Full Bath | 2930.0 | 1.566553 | 0.552941 | 0.0 | 1.00 | 2.0 | 2.00 | 4.0 |
| Bedroom AbvGr | 2930.0 | 2.854266 | 0.827731 | 0.0 | 2.00 | 3.0 | 3.00 | 8.0 |
| Fireplaces | 2930.0 | 0.599317 | 0.647921 | 0.0 | 0.00 | 1.0 | 1.00 | 4.0 |
| Garage Cars | 2929.0 | 1.766815 | 0.760566 | 0.0 | 1.00 | 2.0 | 2.00 | 5.0 |
| SalePrice | 2930.0 | 180796.060068 | 79886.692357 | 12789.0 | 129500.00 | 160000.0 | 213500.00 | 755000.0 |
smaller_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2930 entries, 0 to 2929 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Lot Area 2930 non-null int64 1 Overall Qual 2930 non-null int64 2 Overall Cond 2930 non-null int64 3 Year Built 2930 non-null int64 4 Year Remod/Add 2930 non-null int64 5 Gr Liv Area 2930 non-null int64 6 Full Bath 2930 non-null int64 7 Bedroom AbvGr 2930 non-null int64 8 Fireplaces 2930 non-null int64 9 Garage Cars 2929 non-null float64 10 SalePrice 2930 non-null int64 dtypes: float64(1), int64(10) memory usage: 251.9 KB
ax = sns.boxplot(x="Lot Area", data = smaller_df)
ax = sns.boxplot(x="Gr Liv Area", data = smaller_df)
From the above boxplot, it is observed that there are some outliers in lot area and Gr Liv area. We need to take of those. Actions taken for data cleaning and feature engineering: We need to filter data above 4000 of Gr Liv Area.
smaller_df = smaller_df[smaller_df["Gr Liv Area"]<=4000]
ax = sns.boxplot(x="Gr Liv Area", data = smaller_df)
smaller_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2925 entries, 0 to 2929 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Lot Area 2925 non-null int64 1 Overall Qual 2925 non-null int64 2 Overall Cond 2925 non-null int64 3 Year Built 2925 non-null int64 4 Year Remod/Add 2925 non-null int64 5 Gr Liv Area 2925 non-null int64 6 Full Bath 2925 non-null int64 7 Bedroom AbvGr 2925 non-null int64 8 Fireplaces 2925 non-null int64 9 Garage Cars 2924 non-null float64 10 SalePrice 2925 non-null int64 dtypes: float64(1), int64(10) memory usage: 274.2 KB
# There appears to be one NA in Garage Cars - we will take a simple approach and fill it with 0
smaller_df = smaller_df.fillna(0)
smaller_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2925 entries, 0 to 2929 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Lot Area 2925 non-null int64 1 Overall Qual 2925 non-null int64 2 Overall Cond 2925 non-null int64 3 Year Built 2925 non-null int64 4 Year Remod/Add 2925 non-null int64 5 Gr Liv Area 2925 non-null int64 6 Full Bath 2925 non-null int64 7 Bedroom AbvGr 2925 non-null int64 8 Fireplaces 2925 non-null int64 9 Garage Cars 2925 non-null float64 10 SalePrice 2925 non-null int64 dtypes: float64(1), int64(10) memory usage: 274.2 KB
smaller_df.head()
| Lot Area | Overall Qual | Overall Cond | Year Built | Year Remod/Add | Gr Liv Area | Full Bath | Bedroom AbvGr | Fireplaces | Garage Cars | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 31770 | 6 | 5 | 1960 | 1960 | 1656 | 1 | 3 | 2 | 2.0 | 215000 |
| 1 | 11622 | 5 | 6 | 1961 | 1961 | 896 | 1 | 2 | 0 | 1.0 | 105000 |
| 2 | 14267 | 6 | 6 | 1958 | 1958 | 1329 | 1 | 3 | 0 | 1.0 | 172000 |
| 3 | 11160 | 7 | 5 | 1968 | 1968 | 2110 | 2 | 3 | 2 | 2.0 | 244000 |
| 4 | 13830 | 5 | 5 | 1997 | 1998 | 1629 | 2 | 3 | 1 | 2.0 | 189900 |
Now that we have a nice, filtered dataset, let's generate visuals to better understand the target and feature-target relationships: pairplot is great for this!
sns.pairplot(smaller_df, plot_kws=dict(alpha=.1, edgecolor='none'))
<seaborn.axisgrid.PairGrid at 0x2acc24e0400>
Key Findings and Insights, which synthesizes the results of Exploratory Data Analysis in an insightful and actionable manner: From the above pair plot, it is observed that Overall Qual, Gr Liv Area are related to SalePrice.
Formulating at least 3 hypothesis about this data:
smaller_df["Overall Cond"].mean()
5.563760683760684
smaller_df["Overall Qual"].mean()
6.088205128205129
Hypthesis1: Null Hypothesis - there is no difference between Overall Cond and Overall Qual Alternative Hypothesis - there is difference between Overall Cond and Overall Qual
Conducting a formal significance test for Hypothese1 and discuss the results. We will obtain our statistics, t-value and p-value. We will use scipy.stats library and ttest_ind() function to calculate these parameters.
import scipy.stats as stats
alpha=0.05
t_value, p_value = stats.ttest_ind(smaller_df["Overall Cond"], smaller_df["Overall Qual"])
print("t_value = ",t_value, ", p_value = ", p_value)
t_value = -15.84240193664777 , p_value = 2.2213205979244325e-55
if p_value <alpha:
print("Conclusion: since p_value {} is less than alpha {} ". format (p_value,alpha))
print("Reject the null hypothesis that there is no difference between Overall Cond and Overall Qual.")
else:
print("Conclusion: since p_value {} is greater than alpha {} ". format (p_value,alpha))
print("Fail to reject the null hypothesis that there is a difference between Overall Cond and Overall Qual.")
Conclusion: since p_value 2.2213205979244325e-55 is less than alpha 0.05 Reject the null hypothesis that there is no difference between Overall Cond and Overall Qual.
smaller_df["Gr Liv Area"].mean()
1493.9788034188034
smaller_df["Lot Area"].mean()
10103.58358974359
Hypthesis2: Null Hypothesis - there is no difference between Lot Area and Gr Liv Area Alternative Hypothesis - there is difference between Lot Area and Gr Liv Area
alpha=0.05
t_value, p_value = stats.ttest_ind(smaller_df["Lot Area"], smaller_df["Gr Liv Area"])
print("t_value = ",t_value, ", p_value = ", p_value)
t_value = 59.71848020016229 , p_value = 0.0
if p_value <alpha:
print("Conclusion: since p_value {} is less than alpha {} ". format (p_value,alpha))
print("Reject the null hypothesis that there is no difference between Lot Area and Gr Liv Area.")
else:
print("Conclusion: since p_value {} is greater than alpha {} ". format (p_value,alpha))
print("Fail to reject the null hypothesis that there is a difference between Lot Area and Gr Liv Area.")
Conclusion: since p_value 0.0 is less than alpha 0.05 Reject the null hypothesis that there is no difference between Lot Area and Gr Liv Area.
smaller_df["Full Bath"].mean()
1.5647863247863247
smaller_df["Garage Cars"].mean()
1.7644444444444445
Hypthesis3: Null Hypothesis - there is no difference between Full Bath and Garage Cars Alternative Hypothesis - there is difference between Full Bath and Garage Cars
alpha=0.05
t_value, p_value = stats.ttest_ind(smaller_df["Full Bath"], smaller_df["Garage Cars"])
print("t_value = ",t_value, ", p_value = ", p_value)
t_value = -11.496246853314082 , p_value = 2.9073903122766963e-30
if p_value <alpha:
print("Conclusion: since p_value {} is less than alpha {} ". format (p_value,alpha))
print("Reject the null hypothesis that there is no difference between Full Bath and Garage Cars.")
else:
print("Conclusion: since p_value {} is greater than alpha {} ". format (p_value,alpha))
print("Fail to reject the null hypothesis that there is a difference between Full Bath and Garage Cars.")
Conclusion: since p_value 2.9073903122766963e-30 is less than alpha 0.05 Reject the null hypothesis that there is no difference between Full Bath and Garage Cars.
Suggestions for next steps in analyzing this data: We can check the correlation among the features and target variable. We can then focus on the more related features.
features = smaller_df.corr()['SalePrice'].sort_values()
features
Overall Cond -0.103296 Bedroom AbvGr 0.139766 Lot Area 0.270047 Fireplaces 0.474878 Year Remod/Add 0.540290 Full Bath 0.544486 Year Built 0.565110 Garage Cars 0.652227 Gr Liv Area 0.719463 Overall Qual 0.805236 SalePrice 1.000000 Name: SalePrice, dtype: float64
features.plot(kind='bar',figsize=(10,8))
<AxesSubplot:>
From the graph above, we can deduct some of the highly correlated features and select only those ones for any future analysis.
ax = sns.barplot(x="Overall Qual", y="SalePrice", data=smaller_df)
sns.scatterplot(data=smaller_df, x="Gr Liv Area", y="SalePrice")
<AxesSubplot:xlabel='Gr Liv Area', ylabel='SalePrice'>
sns.scatterplot(data=smaller_df, x="Garage Cars", y="SalePrice")
<AxesSubplot:xlabel='Garage Cars', ylabel='SalePrice'>
A paragraph that summarizes the quality of this data set and a request for additional data if needed: From the above observations, we have seen that only two features (Overall Qual, Gr Liv Area) are more related to the target variable SalePrice. It would be great if we could get additional data which are more related to the price to build a machine learning model to predeict the price more accurately.